Members:
CSE 5324:Shenxiao Mei, Samuel Lefcourt
CSE 7324: Chenming Cui, Zeqi Guo
The transfers of athletes always play a huge part in the modern soccer industry. To some soccer clubs, it is a way to improve their performance in their leagues; to others, it might be a way to make profits. This dataset records the top 250 most expensive player transfers each year starting from the 2000-2001 season.
Player transfers are a crucial act which will usually affect a soccer club's future. Soccer clubs now throw a large amount of money on scouting potential players in the market. However, since soccer is a highly commericial business, every team has a budget that limits their target amount and quality. A middle level team may not be able to afford the transfer fee for a superstar such as Kylian MBappe. This means they have to wait until the transfer fee goes down. Therefore, our goal is to assist teams in predicting when a player's trading value is less than his market value.
Soccer transfers are fluid. For example, Mohamed Salah's market value rose up to 150 million euros after he was transferred for only 42 million euros. Thus, our second goal is to compare a player's market value and transfer fee and make predictions so that a team manager can wait for the best moment to make a transfer. It will be very helpful for the teams with a relatively limited budget and help them find the optimal transfer choices.
Our algorithm will be used to find the possibility of having a surplus transfer fee; that is, having a transfer fee larger than market value. In the real world, soccer teams and soccer players themselves will be interested in our prediction as third parties. The clubs will be able to identify if they can benefit from the transfer based on our prediction. The player will be able to self-evaluate with our prediction. We can consider a soccer transfer as an "investment". Just like any other financial investment in the world, players are like stocks - you either make or lose money. To measure our success, we use the stats from a real economic market. According to https://www.financial-math.org/blog/2017/03/how-accurate-are-market-forecasters/, a top-ranking market forecaster was 78.7% accurate in determining the state of the market. The next best forecasters had 72.5%, 71.8% and 70.5% accuracy scores. It is true that soccer cannot be perfectly quantified, but transfers are a highly mature market system. Each club has its own private scouting team, where each scout can have an average salary of $36,000 dollars per year. Modern soccer scouts will assess a player by not only watching him, but also by doing data analysis. Our prediction is free and using data only, so we feel 70% prediction accuracy can be our success criteria.
The introduction of all attributes existing in our dataset.
import pandas as pd
import numpy as np
print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)
df = pd.read_csv('dataset/top250-00-19.csv') # read in the csv file
df.info()
df.head()
df
We check if there are duplicate records of the transfer by checking the duplicates of the subset called dup; consisting of the name, season and new team. According to the results, we see there are no duplicate records in the data.
dup=['Name','Season','Team_to']
print(df.duplicated(subset=dup, keep=False))
Once we finished checking the existence of duplicates, we delete unnecessary columns. In our case, we delete the 'Name' column because it is not likely to have an impact on the transfer fee or transfer frequency.
if 'Name' in df:
del df['Name']
df
When we examine the quality of the dataset, we find that the 'player position' column has some vague data as shown below. Check out the positions that only occurs once in the entire dataset. This vague data could also be caused by human mistake or insufficient record. Because there is only one player in these positions, we do not have a sufficient amount of data by which we can make a prediction.
df_grouped=df.groupby(by=['Position'])
for val, grp in df_grouped:
print(val,' ', len(grp))
df.info()
df = df[df.Position != 'Forward']
df = df[df.Position != 'Defender']
df = df[df.Position != 'Sweeper']
df = df[df.Position != 'Midfielder']
df.info()
With no duplicates, however, we noticed that there is a significant amount of data missing in market value of the player. The missing data mostly occurs at very early seasons. The incomplete player assessing system might be responsible for this. That means, twenty years ago, there may not have been an accurate system to evaluate a player's market value based on their stats. And there are still some data missing happened in more recent years, beacause some of players donot have huge reputation, or just attend few days so those data cannot be collected successfully.
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import missingno as mn
mn.matrix(df.sort_values(by='Season'))
The rows that have NaN in their market value will be dropped because they could not have been imputed in the future.
df = df[np.isfinite(df['Market_value'])]
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import missingno as mn
mn.matrix(df.sort_values(by=['Season']))
With all rows containing missing data dropped, we can now calculate the premium rate based on market value and transfer fee. This additional column is to help describe the gain and loss of the transfer.
df['premium_rate'] = (df.Transfer_fee/df.Market_value)
df.head()
plt.style.use('ggplot')
plt.figure(figsize=(20, 10))
plt.subplot(1,2,1)
plt.xlabel('Transfer_fee')
plt.ylabel('Number')
plt.title('Transfer frequency vs Transfer fee', fontsize=18)
df.Transfer_fee.plot.hist(bins=100, color='#607c8e',logx=True)
plt.subplot(1,2,2)
plt.xlabel('Age')
plt.ylabel('Number')
plt.xlim([15,36])
plt.title('Transfer frequency vs Age', fontsize=18)
df.Age.plot.hist(bins=20, color='#607c8e')
df.Age.plot.kde(bw_method=0.2, secondary_y=True)
The two plots are just as one would expect. Usually a player's transfer fee is around 10 million to 30 million euros because that is what a middle class team can afford to pay a single player. Notice the one dot around 200 million, that's Neymar's transfer, the most expensive transfer in the soccer history. However, that does not happen very often. On the right side we have a plot in terms of transfer frequency versus age. 25-year-old players tend to have the most transfer frequency because they are more mature than younger players and have less injury than older players. Meanwhile, they are experienced and know how to play in a system.
# df_new=df.groupby('League_from').size()
df_League=df[df['League_from'].groupby(df['League_from']).transform('size')>20].groupby('League_from').size()
df_League.plot(kind='barh', fontsize=14, figsize=(10,6))
plt.xlabel('Number of top 250 transfer', fontsize=14)
plt.title('League that has more than 20 top 250 transfer', fontsize=18)
Not surprisingly, Serie A from Italy, Premier League from England, Ligue 1 from France, LaLiga from Spain, 1.Bundesliga from Germany frequently participate in trades involved with high transfer fees. These five leagues have the highest audience ratings and the most complete soccer systems with cantera(a term used in Spain to refer to youth academies and farm teams) and policies. In the past decade we have seen Serie A and Premier League compete for the top rank in most transfers. The picture below is provided by https://www.sbnation.com/soccer/2014/7/28/5923187/transfer-window-soccer-europe-explained
df_Position_freq=df.groupby('Position').size()
df_Position_freq.plot(kind='barh', fontsize=14, figsize=(10,6))
plt.xlabel('Number of top250 transfer', fontsize=14)
plt.title('Positions that have top 250 transfer', fontsize=18)
With no doubt, Centre-Forward, the main attacker, is the most important position in the field. Sometimes a good striker can help the team win the match singlehandedly. On the other side of the field, Centre-Back, directly playing against opposing forwarders, are also considered valueable. That's why centre-back is the second most frequent position in top 250 transfers.
# df_Season_fee=df[['Season', 'Transfer_fee']]
# df_Season_fee=df_Season_fee.groupby('Season')['Transfer_fee'].mean()
df_Season_fee=df[df.Transfer_fee >30000000]
df_Season_fee=df_Season_fee.groupby('Season').size()
ax = df_Season_fee.plot(kind='bar',fontsize=14, figsize=(10,8),color='y')
plt.title('Transfee fee over 30 million')
plt.ylabel('Million', fontsize=14)
plt.show()
The soccer player market price is increasing over time. Note that 2018-2019 is way lower than 2017-2018. That's because Neymar's transfer significantly affected the market.
df['fee_range'] = pd.cut(df['Transfer_fee'],[0,2e6,2e7,5e7,5e8],
labels=['2millions','20millions','50millions','500millions'])
df['Age_range'] = pd.cut(df['Age'],[16,20,23,28,35],
labels=['16-20','20-23','23-28','28-35'])
df
def conditions(s):
if (s['Position'] == 'Second Striker') or (s['Position'] == 'Centre-Forward') or (s['Position'] == 'Right Winger') or (s['Position'] == 'Left Winger'):
return 'Forward'
elif (s['Position'] == 'Attacking Midfield' or s['Position'] == 'Left Midfield' or s['Position'] == 'Defensive Midfield' or s['Position'] == 'Right Midfield'):
return 'Midfield'
else:
return 'Back'
df['General_Position']=df.apply(conditions, axis=1)
df_new1=df.groupby(by=['Age_range','General_Position'])['Transfer_fee','Market_value'].mean()
ax = df_new1.plot(kind='barh',fontsize=14, figsize=(10,8))
plt.title('Average transfer fee and market value based on position')
plt.show()
From the above graph we can see that transfer fees are mostly higher than market value regardless of the player's position. This is usually caused by the liquidated damages in player contracts. Players playing forwards have the largest market value and transfer fee.
df_prem_age=df[['Age','premium_rate']].groupby('Age')['premium_rate'].mean()
fig = plt.figure(figsize=(15,5))
ax = df_prem_age.plot(kind='bar',fontsize=14, figsize=(15,8),color='k')
ax.axhline(y=1)
plt.title('Average Premium Rate based on Age')
plt.ylabel('Premium_rate')
plt.show()
If a bar fails to pass the red line, then the transfer fee is lower than the market value. It has become standard for soccer clubs to buy very young player with huge potential. In fact, it is not just the potential that causes a huge premium rate. A young player usually has less salary, and can recover fast if injured. When players reach a certain age, such as 34, their strength, as well as stamina, decreases. Clubs are less likely to buy older players with a price over market value.
df_league_mean = df[['League_from','premium_rate']].groupby('League_from').mean()
df_league = pd.DataFrame(df_league_mean)
rate = df_league['premium_rate'][['Serie A', 'Premier League', 'LaLiga', 'Ligue 1', '1.Bundesliga']]
new_rate = np.array(rate)
name = ['Serie A', 'Premier League', 'LaLiga', 'Ligue 1', '1.Bundesliga']
new_table = pd.DataFrame(new_rate,name)
ax = new_table.plot(kind='bar',fontsize=14, figsize=(10,8),color='y')
ax.axhline(y=1)
plt.title('Average Premium Rate based on League')
plt.xlabel('League name', fontsize=14)
plt.ylabel('average premium rate', fontsize=14)
plt.show()
This picture above is about the average premium rate of the top 5 league in this rank list. The league Ligue 1 have the highest premium rate because teams mainly purchase young players with a perfect pontential, not worth the price they pay. Bundesliga have the lowest rate because Bundeslige has a tradition of controlling of budget, so they always purchase some players who have a stable performance, not those who may have high potential.
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig = plt.figure(figsize=(12,5))
df['isPremium'] = np.where(df['premium_rate']>1, 1, 0)
sns.violinplot(x="fee_range", y="Age", hue="isPremium", data=df,
split=True, inner="quart")
The figure above has two features and depicts the x-axis as the fee range and the y-axis as the age range.
# plot the correlation matrix
vars_to_use = ['Age','Market_value', 'Transfer_fee','premium_rate'] # pick vars
df_young_age=df[df.Age<25]
df_old_age=df[df.Age>31]
plt.figure(figsize=(30, 10))
plt.subplot(1,2,1)
plt.pcolor(df_young_age[vars_to_use].corr()) # do the feature correlation plot
# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.subplot(1,2,2)
plt.pcolor(df_old_age[vars_to_use].corr()) # do the feature correlation plot
# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.show()
Soccer transfer has a feature: If a player is under 20, his market value tends to increase; if a player is above 30, his market value tends to decrease. Therefore, we made two plots. On the left is the correlation matrix for young players; On the right is the correlation matrix for players older than thirty. Since 26-30 years old are the golden ages of a player, so they in that range, age has nothing to do with the market value
From the plots we see that market value and transfer fee are definitely positive correlated. Like I have stated above, the older the player gets, the less market value a player will have.
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.offline as pyoff
print('plotly', plotly.__version__)
plotly.offline.init_notebook_mode() # run at the start of every notebook
%%time
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris,load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
X = np.array(df[['Age','Market_value','Transfer_fee']])
tsne = TSNE(n_components=2)
tsne.fit_transform(X)
print(tsne.embedding_)
%%time
tsne = TSNE(n_components=2, init='pca', random_state=0)
Y = tsne.fit_transform(X)
colors1 = X[:, 0]
fig = go.Figure()
fig.add_scatter(x=Y[:, 0],
y=Y[:, 1],
mode='markers',
marker={'color': colors1,
'opacity': 1.0,
'colorscale': 'Viridis'
});
# pyoff.iplot(fig)
plotly.offline.iplot(fig)
t-sne is a machine learning algorithm for visualization used for high-level representations. Thus, it's a good way to decrease the high dimensions to two or three dimensions. In our datasets, we selected attributes "Age", "Market_value" and "Transfer_fee". After t-sne we used plotly to visualize the result and colored the graph according to the instance's "Age" attribute. For the x axis and y axis of the graph are more like to mean nothing and t-sne is just a visualization technique, it will be important to observe the distribution.(quote: https://stats.stackexchange.com/questions/254090/what-are-the-axes-of-a-t-sne-scatterplot) Our resulting graph shows the upper bound as purple, which corresponds to younger athletes. Green correlates with older athletes. This solidfies our belief that athletes' ages truly affect their trading value.
colors2 = X[:, 2]
fig2 = go.Figure()
fig2.add_scatter(x=Y[:, 0],
y=Y[:, 1],
mode='markers',
marker={'color': colors2,
'opacity': 1.0,
'colorscale': [[0,'green'],[0.2,'red'],[0.4,'yellow'],[0.6,'blue'],[1.0,'pink']]
});
# pyoff.iplot(fig2)
plotly.offline.iplot(fig2)
This graph is colored by "Transfer_fee" attribute. Most parts of the graph is green and red, which are considered lower. Only a few points are colored as pink and blue. Thus, only a few athletes can get high transfer fee.
colors3 = np.array(df['isPremium'])
fig3 = go.Figure()
fig3.add_scatter(x=Y[:, 0],
y=Y[:, 1],
mode='markers',
marker={'color': colors3,
'opacity': 1.0,
'colorscale': 'Viridis'
});
# pyoff.iplot(fig3)
plotly.offline.iplot(fig3)
This graph is colored by "isPremium" attribute. "isPremium" is a boolean value, so the graph seperated into two parts.
When we see these three graphs together, we can find that most young athletes may not be qualified for high transfer fees, but their transfer fees are always higher than their market values. The older an athlete is, the less likely his transfer fee is higher than his market value. What's more, the highest transfer fee an athlete can get always comes between 25-30 years old.